# Exploiting Memory Hierarchy: Main Memory, Associative Cache

Dr. Vincent C. Emeakaroha

13-03-2017

vc.emeakaroha@cs.ucc.ie

## Write Allocation

- What should happen on a write miss?
  - Use write allocate buffer in cache
  - Fetch memory block and overwrite appropriate portion
- Alternatives for write-through
  - Update portion of block in memory (no write allocate)
  - Write around: don't fetch the block
    - Since programs often write a whole block before reading it (e.g., initialization)
- For write-back
  - Usually fetch the block
  - Use write buffer to avoid overwrite

## **Example: Intrinsity FastMATH**

- Embedded MIPS processor
  - 12-stage pipeline
  - Instruction and data access on each cycle
- Split cache: separate I-cache and D-cache
  - Each 16KB: 256 blocks × 16 words/block
  - D-cache: write-through or write-back
- SPEC2000 miss rates
  - Instruction-cache: 0.4%
  - Data-cache: 11.4%
  - Weighted average: 3.2%

# **Example: Intrinsity FastMATH**



## Main Memory Supporting Caches

- Use DRAMs for main memory
  - Fixed width (e.g., 1 word)
  - Connected by fixed-width clocked bus
    - Bus clock is typically slower than CPU clock
- Example cache block read
  - 1 bus cycle for address transfer
  - 15 bus cycles per DRAM access
  - 1 bus cycle per data transfer
- For 4-word block, 1-word-wide DRAM
  - Miss penalty =  $1 + 4 \times 15 + 4 \times 1 = 65$  bus cycles
  - Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

## Measuring Cache Performance

- Components of CPU time
  - Program execution cycles
    - Includes cache hit time
  - Memory stall cycles
    - Mainly from cache misses
- With simplifying assumptions:

Memory stall cycles

$$= \frac{Instructions}{Program} \times \frac{Misses}{Instruction} \times Miss penalty$$

# Cache Performance Example

#### Given

- I-cache miss rate = 2%
- D-cache miss rate = 4%
- Miss penalty = 100 cycles
- Base CPI (ideal cache) = 2
- Load & stores are 36% of instructions
- Miss cycles per instruction
  - I-cache:  $0.02 \times 100 = 2$
  - D-cache:  $0.36 \times 0.04 \times 100 = 1.44$
- Actual CPI = 2 + 2 + 1.44 = 5.44
  - Ideal CPU is 5.44/2 =2.72 times faster

## Average Access Time

- Hit time is also important for performance
- Average Memory Access Time (AMAT)
  - AMAT = Hit time + Miss rate × Miss penalty
- Example
  - CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
  - AMAT =  $1 + 0.05 \times 20 = 2$ ns
    - 2 cycles per instruction

# Performance Summary

- When CPU performance increased
  - Miss penalty becomes more significant
- Decreasing base CPI
  - Greater proportion of time spent on memory stalls
- Increasing clock rate
  - Memory stalls account for more CPU cycles
- Can't neglect cache behavior when evaluating system performance

## **Associative Caches**

- Fully associative
  - Allow a given block to go in any cache entry
  - Requires all entries to be searched at once
  - Comparator per entry (expensive)
- *n*-way set associative
  - Each set contains *n* entries
  - Block number determines which set
    - (Block number) modulo (#Sets in cache)
  - Search all entries in a given set at once
  - *n* comparators (less expensive)

# Associative Cache Example



# Spectrum of Associativity

For a cache with 8 entries

#### One-way set associative

(direct mapped)

| Block | Tag | Data |
|-------|-----|------|
| 0     |     |      |
| 1     |     |      |
| 2     |     |      |
| 3     |     |      |
| 4     |     |      |
| 5     |     |      |
| 6     |     |      |
| 7     |     |      |

#### Two-way set associative

| Set | Tag | Data | Tag | Data |
|-----|-----|------|-----|------|
| 0   |     |      |     |      |
| 1   |     |      |     |      |
| 2   |     |      |     |      |
| 3   |     |      |     |      |

#### Four-way set associative

| Set | Tag | Data | Tag | Data | Tag | Data | Tag | Data |
|-----|-----|------|-----|------|-----|------|-----|------|
| 0   |     |      |     |      |     |      |     |      |
| 1   |     |      |     |      |     |      |     |      |

#### Eight-way set associative (fully associative)

| Tag | Data |
|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|-----|------|
|     |      |     |      |     |      |     |      |     |      |     |      |     |      |     |      |

# **Associativity Example**

- Compare 4-block caches
  - Direct mapped, 2-way set associative, fully associative
  - Block access sequence: 0, 8, 0, 6, 8
- Direct mapped

| Block   | Cache | Hit/miss | Cache content after access |   |        |   |  |  |  |
|---------|-------|----------|----------------------------|---|--------|---|--|--|--|
| address | index |          | 0                          | 1 | 2      | 3 |  |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |  |
| 8       | 0     | miss     | Mem[8]                     |   |        |   |  |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |  |
| 6       | 2     | miss     | Mem[0]                     |   | Mem[6] |   |  |  |  |
| 8       | 0     | miss     | Mem[8]                     |   | Mem[6] |   |  |  |  |

# Associativity Example

### • 2-way set associative

| Block   | Cache | Hit/miss | Cache content after access |        |       |  |  |
|---------|-------|----------|----------------------------|--------|-------|--|--|
| address | index |          | Se                         | et O   | Set 1 |  |  |
| 0       | 0     | miss     | Mem[0]                     |        |       |  |  |
| 8       | 0     | miss     | Mem[0]                     | Mem[8] |       |  |  |
| 0       | 0     | hit      | Mem[0]                     | Mem[8] |       |  |  |
| 6       | 0     | miss     | Mem[0]                     | Mem[6] |       |  |  |
| 8       | 0     | miss     | Mem[8]                     | Mem[6] |       |  |  |

## Fully associative

| Block   | Hit/miss | Cache content after access |        |        |  |  |  |  |
|---------|----------|----------------------------|--------|--------|--|--|--|--|
| address |          |                            |        |        |  |  |  |  |
| 0       | miss     | Mem[0]                     |        |        |  |  |  |  |
| 8       | miss     | Mem[0]                     | Mem[8] |        |  |  |  |  |
| 0       | hit      | Mem[0]                     | Mem[8] |        |  |  |  |  |
| 6       | miss     | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |  |
| 8       | hit      | Mem[0]                     | Mem[8] | Mem[6] |  |  |  |  |

## How Much Associativity

- Increased associativity decreases miss rate
  - But with diminishing returns
- Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000
  - 1-way: 10.3%
  - 2-way: 8.6%
  - 4-way: 8.3%
  - 8-way: 8.1%

## Set Associative Cache Organization



# Replacement Policy

- Direct mapped: no choice
- Set associative
  - Prefer non-valid entry, if there is one
  - Otherwise, choose among entries in the set
- Least-recently used (LRU)
  - Choose the one unused for the longest time
    - Simple for 2-way, manageable for 4-way, too hard beyond that
- Random
  - Gives approximately the same performance as LRU for high associativity